##Introduction

Prosper Marketplace is America’s first peer-to-peer lending marketplace, with over $ 7 billions in funded loans. Borrowers request personal loans on Prosper and investors (individual or institutional) can fund anywhere from $2,000 to $35,000 per loan request. Investors can consider borrowers’ credit scores, ratings, and histories and the category of the loan. Prosper handles the s ervicing of the loan and collects and distributes borrower payments and interest back to the loan investors.

A personal loan is an unsecured loan typically from $1,000 - $100,000 with fixed or variable interest rates that can be used to make a large purchase (medical procedure, home improvement, engagement ring, wedding, baby or other major life events) or to consolidate debt such as credit card, for example.

Prosper has a transaction-based business model, in which the company collects revenue by taking a fee on its customers’ transactions. Borrowers who receive a loan, pay an origination fee of 1.00% to 5.00%,depending on the borrower’s Prosper Rating, and investors pay a 1% annual servicing fee.Every loan application is assigned a Prosper Rating—a proprietary rating system that allows for consistency in the evaluation of applicants. Potential investors use Prosper Ratings to help decide whether to commit to invest in your loan listing.

I chose Prosper Loan data to analyze because I am very fascinated with the amounts of personal loans they offer and curious about the market Prosper is interested in. How does Prosper set interest rate? What makes investors want to invest in these short term notes. I wonder what kind of risks and returns associated with these loans? As the borrower’s status changed, such as employment, marital status or spending habits, will the risk of default increase? I will be more curious and post more questions as I go deeper into the analysis.

Data Set of Choice:

ProsperLoanData is one of the data set suggested by Udacity for the R Analysis project

*[Available at:]

(https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv)

*[Variable Definitions availble at:] (https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk- Yx3z0XDEtI/edit#gid=0)

Prepare environment: Loading appropriate libraries to conduct the analysis.

Load data: Produce high level overview of data

Review of initial exploration

=========================================================================================================

Since Prosper claimed that they have originated over $9 billions in consumer loans (to date), I want to see what kind of distributions of the loan counts over the years by LoanOriginationQuarter.

The plot did not come out as I wanted because it’s sorted by quarter and then by year. I have to change the order of the factor levels by specifying the order explicity. After applying this code, ggplot now understands in which order to plot my LoanOriginationQuarter on \(x\) axis.

Bivariate Analysis of Borrower’s Revolving credit card balance

=========================================================================================================

=========================================================================================================

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3121    8549   17600   19520 1436000    7604

========================================================================================================= * Majority of borrower’s revolving credit card balances are between 3000 to 23000. The plot is skewed to the left.

Applying the scalexlog10(), the Revolving Credit Balance looks like a normal distribution.

Bivariate Analysis of Borrower’s Debt Consolidation

## # A tibble: 2 x 4
##   `ListingCategory == 1` loan_amount_mean loan_amount_median     n
##   <lgl>                             <dbl>              <dbl> <int>
## 1 F                                  6690               4700 55629
## 2 T                                  9908               9500 58308

=========================================================================================================

## loanData$ListingCategory == 1: FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   103.9   172.8   229.1   307.4  2179.0 
## -------------------------------------------------------- 
## loanData$ListingCategory == 1: TRUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   166.5   283.0   313.8   410.0  2252.0

========================================================================================================= * *Most borrowers prefer to have a 36 months loan, over 60 months and 12 months. The plot shows the amount of loans used for debt consolidation is mostly in the range of $10,000 - or under.

Prosper’s interest rates are much higher than the Revolving Credit Card’s rates, however, when consolicated into one low monthly payments, it seems to work for borrowers.

For borrowers who use the Prosper Loan’s money to consolidate their debt: The median of their monthly payment is 283, mean is 313.8 and the max is 2,252.
For the rest, montly payment median is at 172.8, mean at 229.1 and max is 2179. Also, if you notice, there is a break accross the plot where there is no loan provided accross the loan amounts (white line on the top of the plot). It’s related to the 2008 period we talked about earlier.

The plot is skewed to the left, with long tail on the right.

Debt Consolidation and Prosper loan monthly payment Analysis

Transform the above plot and focus on majority of loans with payments < 1000 & loan less than 25000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2252.0

=========================================================================================================

Majority of the loans with monthly payment of less than $1000.

Debt Consolidation Analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   114.0   271.0   398.3   525.0 14980.0

=========================================================================================================

The open revolving credit card payment plot is skewed to the left with a very long tail to the right. I can’t tell much from looking at this plot.

The plot to the right is a transformation of the original one, using scale_x_log10().

Employment Status, Prosper Loan Amount

## # A tibble: 9 x 4
##   EmploymentStatus mean_monthly_Income median_monthly_Income     n
##   <fct>                          <dbl>                 <dbl> <int>
## 1 ""                              5165                  4083  2255
## 2 Employed                        6139                  5205 67322
## 3 Full-time                       5043                  4250 26355
## 4 Not available                   4555                  3583  5347
## 5 Not employed                     197                     0   835
## 6 Other                           3568                  3167  3806
## 7 Part-time                       1640                  1379  1088
## 8 Retired                         2987                  2617   795
## 9 Self-employed                   6338                  4333  6134
## 
##  Pearson's product-moment correlation
## 
## data:  loanData$ProsperScore and loanData$BorrowerRate
## t = -248.98, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6536072 -0.6458311
## sample estimates:
##        cor 
## -0.6497361

=========================================================================================================

Looking at the summary group by Employment Status and the monthly stated income of borrowers, it seems this market dealt with borrowers in the high risks end of the spectrum.

They provided loans for borrowers with employment status of “not availbale”, “not employed”, “none” and “Other”.

It seems Prosper has very lenient policies as opposed to guidelines imposed by normal baking industry. I believe it’s because Prosper is not at risk but the investors. I want to investigate further on this. Investors attrack to the higher return, of course, that come with higher risks. So, let’s look at the analyses from the Investor’s perspective.

Is there any correlation between Borrower’s rates and Prosper’s ratings

To answer this question, I used the cor.test to test the two mentioned-relationships above.

The results are listed below:

Pearson's product-moment correlation

I think using the geom_jitter, it shows nicely the grouping of Prosper borrower’s scores in relation to the borrower interest rates.

There is a negative correlation between Borrower’s Rates and Prosper’s Score of -.65. The higher the borrower’s rates, the lower Prosper’s ratings (0 to 10).
As it’s clearly shown in the plot and that is evidence in the smooth red line).

With that in mind, I shall look at the Return Rate and Investors next.

Initial data exploration - Return of Investment

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.074   0.092   0.096   0.117   0.284   29084

=========================================================================================================

The mean of the rate of return is 9.6% with the max of 28.4%. However, sometimes they also have a negative rate of return too.

If that’s the case, then are all the loans fully funded? Let’s find out.

I am going to create a new data frame call ReturnInvestment and contain the following variables:

Let’s look at the Investors next.

Investors

##   mean_funded median_funded min_funded max_funded      n
## 1   0.9985835             1        0.7     1.0125 113937
##   Mean_Investor Meidan_Investor min_nvestors max_nvestors      n
## 1      80.47523              44            1         1189 113937

=========================================================================================================

Wow, the result from the summary of Investors code surprises me.

Now I understand why Investors are willing to invest in these loans. They all share the risks and investors don’t have to invest in one loan at a time. They can invest in many loans and with small amounts per loan. Thus their risks are also diversified. They manage their risks like how people manage the stock investment porfolios.

Lets look at sole investors, since I think this is the most risky area and whether it’s worth the risk.

Sole Investor data exploration - Estimated Return and Borrower’s Debt to Income Ratio

=========================================================================================================

The reason I chose this analysis because as discovered from the previous analysis of Investors, most of them don’t solely invested in one loan but multiple loans with small amounts. Their risks are well diversified over the porfolios.

So by looking at these sole investors, we can see clearly that on the average, their return of investment is below 10% but they are taking a higher risk than if they invested in multiple loans for the same amount of funding.
The return is not higher but contain higher risk.

To conclude this project, I will create scatter plot matrices, by selecting seed and subset to include only variables that I want to show.

Issue with ggpairs plot and selected variables

========================================================================================================

Since this is such big data, what I want to do is to select 5000 loans with loan amount < 10000 and the listing selection is debt consolitation. Then I subset my loan data and pick only variables I think they have direct correlated to the investors’ rate of return. However, when plotting it using ggpairs, I have so much problems with this single plot. It took more than 12 hours to work on just one. I carefully used the same codes that i learned and went over the lessons more than 10x just to figure out why the plot does not work.

Finally, I get it. The lower left the triangle plot use the group histograms for quantitative pairs and scatter plots for quantitative pairs. The upper triangle, use x and provide the correlation for quantitative quantitive pairs.

To wrap up this project, I want to build the lm model to see if the prediction of the rate of return is closed to what Prosper said on their website. I pick one of the loan on the web site with funding required, that has the following criteria:

I used the lm model to build the variables ProsperScore, ListingCategory, LoanOriginalAmount, BorrowerAPR, MonthlyLoanPayment.

Apply the lm model to Prosper current investment listing, using the criteria listed

Evaluate how well the prediction of return of investment from Prosper based on the lm model.

It did not work as it created the following error. I can’t find anything addressed this issue so I decided to use another approach as seen next.

Warning message: In predict.lm(m5, newdata = This_loan, interval = “prediction”, : prediction from a rank-deficient fit may be misleading

Summary of the Result

## 
## Calls:
## m1: lm(formula = EstimatedReturn ~ I(LoanOriginalAmount), data = loanData[loanData$LoanOriginalAmount < 
##     20000 & loanData$ListingCategory == 1 & loanData$EstimatedReturn > 
##     0 & loanData$ProsperScore > 7, ])
## m2: lm(formula = EstimatedReturn ~ I(LoanOriginalAmount) + BorrowerAPR, 
##     data = loanData[loanData$LoanOriginalAmount < 20000 & loanData$ListingCategory == 
##         1 & loanData$EstimatedReturn > 0 & loanData$ProsperScore > 
##         7, ])
## m3: lm(formula = EstimatedReturn ~ I(LoanOriginalAmount) + BorrowerAPR + 
##     EstimatedReturn, data = loanData[loanData$LoanOriginalAmount < 
##     20000 & loanData$ListingCategory == 1 & loanData$EstimatedReturn > 
##     0 & loanData$ProsperScore > 7, ])
## m4: lm(formula = EstimatedReturn ~ I(LoanOriginalAmount) + BorrowerAPR + 
##     EstimatedReturn + ProsperScore, data = loanData[loanData$LoanOriginalAmount < 
##     20000 & loanData$ListingCategory == 1 & loanData$EstimatedReturn > 
##     0 & loanData$ProsperScore > 7, ])
## m5: lm(formula = EstimatedReturn ~ I(LoanOriginalAmount) + BorrowerAPR + 
##     EstimatedReturn + ProsperScore + ListingCategory, data = loanData[loanData$LoanOriginalAmount < 
##     20000 & loanData$ListingCategory == 1 & loanData$EstimatedReturn > 
##     0 & loanData$ProsperScore > 7, ])
## 
## ====================================================================================================
##                               m1             m2             m3             m4             m5        
## ----------------------------------------------------------------------------------------------------
##   (Intercept)                 0.084***       0.008***       0.008***       0.002          0.002     
##                              (0.001)        (0.000)        (0.000)        (0.001)        (0.001)    
##   I(LoanOriginalAmount)      -0.000***       0.000**        0.000**        0.000***       0.000***  
##                              (0.000)        (0.000)        (0.000)        (0.000)        (0.000)    
##   BorrowerAPR                                0.442***       0.442***       0.447***       0.447***  
##                                             (0.002)        (0.002)        (0.002)        (0.002)    
##   ProsperScore                                                             0.001***       0.001***  
##                                                                           (0.000)        (0.000)    
## ----------------------------------------------------------------------------------------------------
##   R-squared                   0.017          0.780          0.780          0.780          0.780     
##   adj. R-squared              0.017          0.780          0.780          0.780          0.780     
##   sigma                       0.026          0.012          0.012          0.012          0.012     
##   F                         217.519      21725.060      21725.060      14511.305      14511.305     
##   p                           0.000          0.000          0.000          0.000          0.000     
##   Log-likelihood          27155.705      36326.200      36326.200      36335.799      36335.799     
##   Deviance                    8.526          1.908          1.908          1.905          1.905     
##   AIC                    -54305.411     -72644.399     -72644.399     -72661.599     -72661.599     
##   BIC                    -54283.170     -72614.745     -72614.745     -72624.531     -72624.531     
##   N                       12253          12253          12253          12253          12253         
## ====================================================================================================
## [1] 0.01247088
## [1] NA

=========================================================================================================

Using the predictive model, the plot resulted with the following error:

In predict.lm(m6, newdata = This_loan, interval = “prediction”, : prediction from a rank-deficient fit may be misleading

This is another challenge I am running into and I could not find anything on the web to address this error. I contacted my mentors but haven’t heard back from him either. So I decided to try another method, because the predictive model may not work anyway.

The standard deviation summary is what I used. The plot is showing the standard deviation summary of the residuals across Estimated Return Rate and we see the residuals variability is not equal.

The investors are attracted to these investments because their risks are very diversified. They also get their investment back on a monthly basis. They could recover their cost first (because payments are mostly on interest first) before the borrowers could default on their loans. Prosper has a very smart business model which help both the Investors and Borrowers at the same time.

Final 3 Plots Summary

=========================================================================================================

I chose 3 plots which represents the point of views from: The Investor, The Borrower and Prosper Company.

Reflection

When I chose this project, I did not know what to expect as I do not know anything about Prosper. As I work on it, I discovered things that are unusual, such as there was no activities in the periods of 2008-2009. So I did some research and learned so much about them. With this knowledge, I am able to analyze its loan in a more intelligent ways.

Sometimes I put a hat on as a borrower to understand their view of points, why they are willing to pay for higher rates than their current credit cards.
(This is one of my assumptions as I didn’t know any credit cards with rates > 20%.) And why they think going with Prosper will benefit them?

Sometimes I look at the data from an Investor point of view. With borrowers having such high credit card debts, low credit score, what makes Prosper investment attractive? All of these questions are unveiled when I get deeper into the analysis and actually it surprised me quite a bit. The results are sometime not as I expected to see as already mentioned and explained in each phase of the analysis.

I ran into many challenges with the codes. Most of the time, I can resolve it by looking it up, went back to the class lessons or just try different ways.
It takes times to correct these issues but I also learned new things from it.

For example, the below code always resulted in an error. As it does not recognize the select() function. I checked many times, refreshed, reviewed my notes, review my lectures but nothing worked. Then I google and found this solution, by typing the code ls(“package: MASS”) and added dplyr:: before select(), then it worked fine.

I also had huge problem when running ggpairs at the very end of the project. I spent over 12 hours on that code alone. I know I did not wisely spending my time on the project. I should move on and use another plot to get the project done. However, I could not let go because I wanted to understand why it did not work. I followed the instructions very well, checked for every single character of my codes. I used set.seed(), select sample from data and subset it. It kept saying that my codes are much larger than the 15 ? allowed? But I did it at the end and I am so happy about it.

In summary, I am glad that I chose this data set to work on because first of all, I don’t drink and I don’t know anything about wine. Secondly, there were not too many variables to analyze from the wine project. This Proser Loan is interesting and I think I can spend days to analyze it further. There are so mamy variables that I did not even touch. They are all good variables with information that investors could use to analyze their investment of choice. My curiousity just kicked in as I work on it further and deeper. Sometimes I feel like I am repeating but actually I am not. You must look at it from the broad perspective first and then as you peel the onions, you discover things that you did not expect to see. That’s exactly how I felt working on with this project.

=========================================================================================================